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Exploring heritability of complex traits is a central focus of statistical genetics. Among 
various previously proposed methods to estimate heritability, variance component 
methods are advantageous when estimating heritability using markers. Due to the 
high-dimensional nature of data obtained from genome-wide association studies (GWAS) 
in which genetic architecture is often unknown, the most appropriate heritability estimator 
model is often unclear. The Haseman-Elston (HE) regression is a variance component 
method that was initially only proposed for linkage studies. However, this study presents 
a theoretical basis for a modified HE that models linkage disequilibrium for a quantitative 
trait, and consequently can be used for GWAS. After replacing identical by descent (IBD) 
scores with identity by state (IBS) scores, we applied the IBS-based HE regression to 
single-marker association studies (scenario I) and estimated the variance component 
using multiple markers (scenario II). In scenario II, we discuss the circumstances in 
which the HE regression and the mixed linear model are equivalent; the disparity 
between these two methods is observed when a covariance component exists for 
the additive variance. When we extended the IBS-based HE regression to case-control 
studies in a subsequent simulation study, we found that it provided a nearly unbiased 
estimate of heritability, more precise than that estimated via the mixed linear model. 
Thus, for the case-control scenario, the HE regression is preferable. GEnetic Analysis 
Repository (GEAR; http://sourceforge.net/p/gbchen/wiki/GEAR/) software implemented 
the HE regression method and is freely available. 

Keywords: Haseman-Elston regression, GWAS, identity by state, variance component, missing heritability, case- 
control, mixed linear model, REML 



INTRODUCTION 

So-called "missing heritability" can occur due to various reasons, 
such as small sample size, underrepresented variant spectrum, 
experimental design, and improper methodological assumptions 
(Manolio et al., 2009). Because of the high-dimensional nature 
of genome-wide association study (GWAS) data, in which the 
number of markers (M) is far greater than the number of indi- 
viduals {N), estimating heritability is difficult. For instance, if the 
statistical power is insufficient, variants associated with a small 
effect may not be captured under a stringent p-value threshold 
(~ 10^^). This obstacle can be partially bypassed by implement- 
ing the mixed linear model, which uses the genetic relationship 
between individuals estimated from single nucleotide polymor- 
phism (SNP) markers in lieu of fitting hundreds of thousands 
of markers together (Yang et al., 2010). Nevertheless, it has 
been recently disputed how an estimator should be adjusted 
under genetic architecture. Speed et al. (2012) suggested using 
a weighted genetic relationship matrix under different genetic 
architecture, which is often unknown. As demonstrated in large- 
scale empirical data studies (Lee et al, 2013), Speed's ad-hoc 
weighing method depends on the genetic architecture and does 
not often outperform plain weight methods upon comparison. As 
the genetic architecture, such as the relationship between variant 



frequency and variant effect, is often unknown, criteria should be 
established to justify the model used to estimate heritability. 

For GWAS, as many samples are collected to study diseases, 
many studies eventually adopt a case-control design. Due to 
ascertainment in case-control studies, scale transformation is 
necessary. Without scale transformation, the heritability on the 
observed scale can be greater than 1, rendering the estimated 
heritability meaningless, as it is not representative of its heritabil- 
ity on the liability scale, which is more interpretable (Falconer, 
1966) for disease data. An equation (Lee et al., 2011) that trans- 
forms heritability from the observed scale to the liability scale has 
been proposed (as the equation was indexed as the 23rd equation 
in Sang Hong Lee's paper, it is henceforth denoted as Hong23) 
and was investigated under the infinitesimal model, for which 
the number of casual loci is infinite. However, in practice, disease 
loci are reasonably limited for many diseases (Yang et al., 2011), 
which raises the question of whether or not Hong23 works well 
for mixed linear model estimates if the infinitesimal model does 
not hold. 

All of the above concerns are related to the heritability esti- 
mated via variance component methods implemented thus far 
in mixed linear models. The Haseman-Elston (HE) regression 
is a prestigious method for estimating variance components 
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(Haseman and Elston, 1972). The HE regression, a well-known 
tool for linkage studies that uses identity by descent (IBD) (Lynch 
and Walsh, 1998; Hill and Weir, 2011) scores, however, seems a 
rusty weapon in the genomics analysis armory of the GWAS era. 
This is because the HE regression relies on relatedness measured 
on IBD but not identity by state (IBS). Although IBS has been 
employed for linkage analysis, such as under affected-pedigree- 
member design (Lange, 1986b; Weeks and Lange, 1988; Bishop 
and Williamson, 1990), its performance is largely dependent on 
marker polymorphisms and may cause high false positives when 
ad-hoc weighting functions or incorrect frequencies are adopted. 
As an underrepresented concept of the linkage era, IBS is neither 
well-adapted to linkage studies nor employed in the original HE 
regression framework. 

Taken together, the following questions remain. 

(1) Can the HE regression be applied to the IBS content such as 
for GWAS? If the answer is affirmative, what is the theoretical 
basis and the genetic interpretation in this new context? 

(2) An equilibrium has been established between the HE regres- 
sion and the variance component method (Sham et al., 
2002) for linkage studies. Does this equilibrium stand for 
high-dimensional data such as GWAS data and what are its 
assumptions? 

(3) If the IBS-based HE regression is applied to case-control 
studies, can it estimate heritability better than the mixed 
linear models? 

(4) Given the recent dispute regarding heritability estima- 
tion of complex traits, can HE regression provide further 
justification? 

Recently, a new theory using like-standardized IBS has paved 
another route to assess genetic relatedness (Ritland, 1996; Powell 
et al., 2010; Yang et al, 2010) between unrelated individuals 
(conventional sense). The IBS score resembles the conventional 
IBD score (Powell et al., 2010), which raises the question of 
whether this IBS score can be used in the HE regression for unre- 
lated individuals. In this study, by replacing the IBD scores with 
standardized IBS scores, we used the HE regression to conduct 
association studies for GWAS data. Assuming random mating, 
biallelic loci, and additive genetic effects only on the genetic archi- 
tecture of quantitative trait loci (QTLs) underlying a complex 
trait, this report establishes the theoretical basis for using the HE 
regression for GWAS. Two generic scenarios were investigated, 
and their regression coefficients were derived and have geneti- 
cally meaningful interpretations. In scenario I, the IBS score was 
assessed via a marker that was in linkage disequilibrium (LD) with 
a QTL. This enabled the HE regression to be a tool for single- 
marker GWAS. In scenario II, IBS score was assessed on multiple 
markers, each of which could be in LD with multiple QTLs. This 
allowed the HE regression to be used to estimate the variance 
component tagged by markers. 

The second scenario has implications for estimation of her- 
itability for complex trait using whole genome-wide markers 
together, similar to the mixed linear model (Yang et al., 2010; 
Lee et al, 2011). Using an analytical method that establishes 
the equivalence between the IBS-based HE regression and the 



mixed linear model, a simple criterion is proposed to justify the 
estimates in this study. A similar equivalence between the HE 
regression and the variance component analysis with the mixed 
linear model was determined in the context of linkage analy- 
sis (Sham and Purcell, 2001). In this study, their equivalence is 
established under the context of GWAS, and the conditions for 
equivalence are explored analytically as well as in silico. After 
extending the established HE regression into case-control scenar- 
ios, we demonstrated that Hong23 fits the estimate from the HE 
regression better than that from the mixed linear model. 

Furthermore, as the IBS-based HE regression uses least 
squares, it is advantageous in its computational efficiency and is N 
{N is the sample size) times faster than the mixed linear model. In 
order to facilitate the application, the HE regression algorithm for 
GWAS data was implemented in Java software, GEnetic Analysis 
Repository (GEAR), which is freely available online. 

As the first half of this report is focused on establishing the 
mathematical basis of the IBS-based HE regression, many mathe- 
matical symbols are introduced (Table 1). In the text below, the 
HE regression is the IBS-based HE regression unless explicitly 
noted otherwise. 

THEORY OF THE IBS-BASED HE REGRESSION 

For an individual, the phenotype is denoted as y;, which fol- 
lows the normal distribution of N{fx,y, a^), and the genotype is 
Xi = [x,i, x,2, . . . , x,m], in which M is the number of markers. 
For the i* individual, the genotype at the fc^'' locus is which 
counts the reference alleles at the fc* locus. The reference allele is 
denoted as Ak and the alternative is ajt. The frequency of Ajt is pk 
and the frequency of cij- is qi^. gi^ is the set of possible genotypes, say 
{%fli> Aj^^i^, AjtAjt), at the ¥^ locus. Consequently, fljtfli;, Aj-aj-, 
and AjtAj: are coded as 0, 1, and 2, respectively. After standardiza- 
tion, X, is expressed as s, = , '^'^Z^ , ■ ■ ■ , ^'"^^^'^ . For 
Xfj:, given a genotype of ajcfl/t; J^kC^k-y and AjtAjt, their standardized 
scores are ^=£t, -%=£t and ^==, respectively. The additive 

•JT-Pklk -J^Pklk -J^Pklk 

effect of the QTL is denoted as Pi. Throughout the study, we 
assume a polygenic model with L QTLs. 

THE IBS-BASED HE REGRESSION 

Haseman and Elston (1972) proposed a linear model, Yij = ji + 
hitij + Cij, for detecting linkage between a marker and a QTL in 
a full-sib design. Y,j represents the squared difference between a 
pair of full sibs, and JTy is the proportion of IBD at an observed 
marker locus; fi is the intercept of the regression, b is the regres- 
sion coefficient, and Cy is the residual. The mathematical expecta- 
tion of the regression coefficient is fo = — 2 (1 — 2c)^ cr^, in which 
c is the recombination fraction between the marker locus and the 
QTL, and is the additive genetic variance of the QTL. 

Now consider a sample consisting of N unrelated individuals. 
If the phenotype for the i'^ individual is y,-, we can modify the HE 
original regression as below 

Y,j = ii + bQ.j + e,j (1) 

in which Yij = (yi—yjf' represents the squared difference, 
Q.ij is the measure of the genetic relatedness of a pair of 
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Table 1 | Notation definitions. 



Notation Definition 

Pi^ and Allele frequencies of A and a at the /c* locus. A is the reference allele. 

Df;/ Linkage disequilibriunn of a pair of loci, D^i = faja; — QkQi, in which fg^a, is the frequency of haplotype ajja;. 

rj;/ and r(;/ = p(a/|a/t) and = p(/\/|/\/t), the conditional probabilities of the two coupling haplotypes, af;a/ and Aj^A^. 

Pki Pki = _jp^qlpiqi • the Pearson's correlation between a pair of biallelic loci, k and /. 

The mean of the squared correlation between any marker pair, including the marker with itself. This can be estimated from the genotype 

data. 

p?, The mean of the squared correlation between any a marker and a QTL. 



A 



M The number of markers. 

Me The effective number of markers. See the text and Supplementary Note II for definition. 

X/ X,- = [x/i , x,-2, X/3, . . . , X/m], genotype scores, a vector. It counts the reference allele number for each locus. 

Qk The genotype set for the fc* locus, such as = {Sk^k- ^kSk, -^fr-^fc). Analogously, for a QTL, gi^ = {Q/tSfr. QkQk^ QkQk}- 

Si Standardized genotype scores for the /* individual, a vector. S; = , ^C2S, . . . , "'""^P" 

L The number of QTLs. 

W Sample size. 

AT M=^^. 

Af' Af' = _ (qi -I- 1 ) jn which d is the number of parameters in the HE regression. 

// The phenotype of the /* individual. 

Yij The square of the phenotype difference between the /* and the /* individuals. 

Qij The genetic relatedness between the /"' and the j* individuals. See the text for definition. 

Pi The additive effect of the /* QTL. 

tr| Total additive variance. 

tr^ Narrow-sense heritability. 

ai The square-root of the additive variance of the /* QTL, ct; = ^/'^PiQiPi- 

Hong23 Expressed as tij = t?^ '^'^^J ~ pj , is the heritability on the liability scale, /i^ is the heritability on the observed scale directly estimated 
based on the case-control data, K is the disease prevalence, P is the proportion of cases in the data, and z is the height of the standard 
normal distribution in which the prevalence is located (Lee et al., 2011 ). 

Subscript Subscripts / and j are used to indicate individuals, and k and / are used to indicate loci, which can be either markers or QTLs. 



individuals, and is the residual. Given N unrelated indi- 
viduals, there are A/" = iV x (N — 1) such individual pairs. Q,ij 
is the similarity score between a pair of individuals based on 
the IBS, as recently proposed (Powell et al, 2010; Yang et al., 
2010). 

For the linear model in Equation (1), the expectation of 

the regression coefficient is E{b) = "-^j^^^-y-- var{Uij) is the 

variance of the genetic relatedness. cov (Yy, ^ij) = E (f2y7y) — 
E (Qij) E (Yij) = E(QyYij) because £(f2y) = 0 [see the def- 
inition for f2y in section The Derivation of var(Q,ij) and 
Effective Number of Markers {Me)], and E{p.ijYij) = 

^k= 1 ^=cik(igk^ycjk<^gkSikSjk [E {yi\x,k) - E {yj\xjk)fp {x,k) pixjk) is 
the mathematical expectation of the joint distribution for f2y 
and Yij. In order to derive var{Q.ij) and cov (Yij, S2y), we need 
to introduce the haplotype distribution of a biallelic loci pair 
(section Haplotypes of a Biallelic Loci Pair). When the haplotype 
is constructed on a pair of markers, it leads to the derivation 
of var{Q.tj) [section The Derivation of var(Q.fj) and Effective 
Number of Markers (Me)]; when the haplotype is constructed 
for a marker and a QTL, it leads to E{yi\xik), the conditional 
expectation of the phenotype based on a marker [section The 
Derivation of E (yi\xik) ] . 



DERIVATIONS OF var(a,j) AND E (K, |x,fc) 
Haplotypes of a biallelic loci pair 

For a pair of biallelic loci, there are four haplotype phases, and 
their conditional probabilities are as summarized in Table SI. 
rjt; = p(ai\ai^) and R^i =p(A;|Ajt) are defined as the conditional 
probabilities of the haplotypes in the coupling phases, such as aj^ai 
and AkAi, respectively; 1 — r^i and 1 — Rki represent the condi- 
tional probabilities of the alleles in their repulsion phases, such 
as a^Ai and Aj:fl;, respectively Da = fA^Ai - PkPl, in which /a^a, is 
the frequency of the haplotype A^A;; D^-; is the covariance between 
the loci, quantifying the LD between them. 

The correlation of a pair of biallelic loci can be expressed as a 
2x2 correlation 

Dki 



Pkl 



^Pkqkpiqi 



(2) 



p^l is often used to parameterize the LD of a loci pair (Hill and 
Robertson, 1968). For more LD parameterization, please refer to 
Devlin and Risch (1995) and Wray (2005). 

Once the conditional probabilities of the haplotypes are 
defined, it is straightforward to obtain the joint probabilities 
of the genotypes for a pair of loci. For example, under random 
mating, the probability of the genotype Aj;Ai:A;A; 
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is p (AkAkAiAi) = p (AkAk \AiAi) p (AkAk) = p (A; \Ak) p {Ak) 
p {Ai\Ak) p (Ak) = pIrIi- Analogously, this leads to the joint 
probabilities of the other eight two-locus genotypes (See 
Table 2). 

The derivation of var (fl,y) and effective number of markers (Me) 

For a sample consisting of unrelated individuals, their pairwise 
genetic relationships, say additive genetic relationships, can be 
estimated with genetic markers, such as SNP markers (Powell 
et al, 2010; Yang et al., 2010). The genetic relatedness f2y between 
the r'^ individual and the individual is measured by the dot 
product of their standardized genotypes and then divided by the 
number of markers. 

^ ^!i:^^J_5^M (X,7: - 2pk) (x,k - 2pk) 

The possible relatedness scores of a pair of individuals are sum- 
marized in Table 3A, totaling nine products. After combining 
the same score values, there are seven unique terms as in 
Table 3B. It is easy to derive that E (f2y) = 0 and var (^2,-,) = 
^E^jE^jCov(Qyi, ^2y;), in which cov {^yk, ^2iji) = 
E (Qijk^iji) - E (^ijk) E {Qijk)=E{Uijki2yi) because E (f2y.) = 0. 

Q,,j is informative in revealing hidden relatedness. For exam- 
ple, for the duplicated individual in the sample, E(Q,,j) = 1; for 
first-degree relatives, -E(f2y) = 0.5; for second-degree relatives, 
E (f2y) = 0.25. Consequently, it can control the entry of samples 
that are under the expected cutoff for relatedness. 

After some additional algebra (see Supplementary Note I), we 
arrived at the following equation. 

cov {Uijk, niji) = pIi (4) 

When the /c* locus and the locus are in linkage equilibrium, 
cov (^ijk, ^iji) = 0; when the fc* locus and the locus are at the 
same locus, cov(^^ijk, ^iji) = 1. 

var{a,) = i,E^L,E;l,p,^, = 1 + ^^i:^ .zf^^ kdi 0) 

The distribution of p^j varies with pk and pi (Wray, 2005). We 
can also interpret var (f2y) as the mean of the squared Pearson's 
correlation between the markers along the genome, denoted 
asp2^. 



For simplicity of the following derivation, the concept of an 
effective number of markers, Mf,, is introduced here. Intuitively, 
as markers are often in linkage disequilibrium, the real number 
of "independent" markers is smaller than the total number of 
the markers genotyped. This concept was previously introduced 
under the context of risk prediction (Purcell et al., 2009), and 
Me was evaluated using Monte Carlo simulation. As indicated 
in Supplementary Note II, l/var{Q,y) is the mathematical expec- 
tation of the effective number of markers evaluated under the 
simulation method (PurceU et al, 2009). For example, for 100 
equifrequent biallelic loci, if the correlation for each pair of con- 
secutive markers is 0, 0.25, 0.5, and 0.75, the effective number of 
markers is approximately 100, 90, 61, and 29, respectively. Real 
GWAS data are often at a magnitude of 10* (Vinkhuyzen et al, 
2013). 

Tfie derivation ofE (Kil>fft) 

The expected phenotype of y,- given genotype Xik depends on 
the QTL genotype, say the locus, in LD with xik- Assuming a 
biallelic QTL in LD with the marker, the conditional expectation 
of the marker is E {y,\Xjk = AkAk) = T,^jg,Siip(xii\x,k = AkAk), 
in which gi = {QiQi, Qiqi, qiqi], and E{yi\xik = AkAk) = 
fi,xRl + Ox 2Rki (1 - Rki) - A X (1 - Rkif = (2Rki - 1) Pi- 
Analogously, we can derive the expected values of 
E {y,\x,k = AkUk) = (Rki - Tki) Pi and E (y;|x,jt = UkUk) = 
{1 — Irki) Pi (See Table 4). Once E{yi\xik) is defined, the 
distribution of E(Yij\x,k, Xjk) can be tabulated as in Table 5. 

DERIVING THE MATHEMATICAL EXPECTATION OF THE REGRESSION 
COEFFICIENT 

In this section, we investigated two scenarios to derive the 
expected value of the regression coefficient for Equation (1). In 
scenario I, genetic similarity is estimated at a single marker, which 
is in LD with one or more QTLs. In scenario II, genetic similarity 
is estimated based on M markers, each of which can be in LD with 
L QTLs. 

Scenario I: one marker and one QTL 

Under the scenario of one marker, say the fc* marker, 
and one QTL, say the QTL, since var {p,ijk) = I, 
E(b) = E{QijYy), which is E {nijY,j) = Y.^.i^i:xj;^SikSjk 

[E{yi\Xik) - E{yj\xjk)f p(Xik)pixjk). Consequently we can 
derive the regression coefficient as 



Table 2 | The joint distribution of two loci. 



The k"" locus 







3k 


AkSk 


AkAk 


The /* locus 


a,a, 




iPkQkrkiC^ - Flki) 


pIV - 






2(7^ r(;,(1 - rki) 


IpkQk [rkiRki + 0- rki) (1 - Rki)] 


2p|Rw(1 - Rki) 




AiAi 


qIO - rkif 


2p,,qt,RkiC\ - rki) 




Marginal probability 




4 


^PkQk 


pI 



Each cell lists the joint probability of a genotype pair at the k'^ and the locus, respectively. 

rill and Rj^i, as defined in Table SI, represent the conditional probabilities of the haplotypes ai^ai and A^Aj, respectively 
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Table 3A | The joint distribution of the genetic relatedness between individuals / and j. 



Genotype 



Individual / 

Sjii Frequency 



Genotype 



Individual j 



Frequency 



Relatedness for individuals / and j 



'■ijk 



Frequency 



-2pk 



3kSk 
Ak3k 
AkAk 



-2pk 

V^PkOk 
Qk-Pk 



■JT^kQk 
2qk 

V^PkOk 



4 



^PkQk 



Pi 



2PkQk 
-2pk{qk - Pk) 

2PkQk 
-^PkQk 

2PkQk 



2pkql 
f^k4 



AkSk 



Qk-Pk 
•J'i-Pkqk 



^PkQk 



akSk 

Akak 
AkAk 



-2pk 

sf^PkQk 
Qk-Pk 



V^PkQk 
2qk 



■JT^kQk 



4 



2PkQk 



Pi 



-2pk(qk - Pk) 

2pq 

(Qk - Pkf 

2pkqk 
2q(qk - Pk) 

^PkQk 



2pk4 

^PWk 
2plqk 



AkAk 



2qk 

■JT-P^qk 



Pi 



3k3k 
Akak 
AkAk 



-2pk 

V^PkOk 
Qk-Pk 



•J2pkQk 
2qk 

V^PkOk 



4 



2PkQk 



4 



-^PkQk 

2PkQk 
2q{qk - Pk) 

2PkQk 
H 
ipkQk 



pl4 

2plqk 



Pi 



As A was set as the reference allele, with a frequency of p, aa, Aa, and AA were coded as 0, 1, and 2, respectively. 



Table 3B | A reorganization of Table 3A to illustrate the relatedness joint distribution of a pair of individuals. 


^ ^pI -2pkiqk-Pk) -^PkQk (Qk-Pkf 
2ptqk 2pkqk 2p^q^ Ip^qk 


2qk(Qk - Pk) 
2PkQk 


Aql 

2Pkqk 


Frequency Ap^ql 2plql Aplql 


^plQk 


P\ 



Qjjk represents the product of a standardized genotype pair for individual i and individual j on the k'" locus. 



E{b) = E{QijYij) = i:^^i^'£^.f^StkSjk[E {y,\x,k) 

, m2 , X , N -2pk{qk-pk) 2o2^ 3 

-E(yj\xjk)\ p(x,k)p(xjk) = -^^ '-r^iPflpkql 

-2pk{qk-pk) , , ,2^2-,^ ,3 , -4mjc ,, 2^2^2,2 

+ T— (-^«) Pi ^Pklk + -T—'^T^klPlPklk 

Ipkqk 2pkqk 

,Z^mk ^ ^Igl^J. , ^'?''fa~^'^) ,2a2o^3, 

+ -ipm^^^-''^ ^'^^^^ + ipkqk 

, 2qk{qk-pk) ^ ^2o2,„3, 

H r- i-m) Pi2pi,qk 

2pkqk 



and 



E{b) = -Ar^iPkqkPf 



(6) 



in which rj:/ = 1 — r^i — Rki- 

When the QTL overlaps with the marker, or the correla- 
tion between the QTL and the marker is 1, E{b) = —ApkqkP^ 
because rjt; = Rti = 1 . When the QTL is in linkage equilibrium 
with the marker, rjt; = qi and R^i = pi, 1 — r^i — Rki = 0, and 
consequently = 0. 

According to Table SI, ru + Rki = (qi + ) + (pl + ) = 

1 + Consequently, the expression of Equation (6) can be 



rearranged as E{b) = - 
pair of biallelic loci pki 

4fi 



4 PkqkPf- The correlation of a 

_ Du 



[Equation (2)], and conse- 



quently E(b) = —2pljaf, in which cr/ = y/TpiqifHi. Alternatively, 



we can write 



E(b) 



(7) 



In the GWAS context, squared LD (Pearson's correlation) is 
in lieu of the recombination fraction for linkage. The mathe- 
matical expectation of the regression coefficient resembles the 
one in the original LIE regression. However, it should be noted 
that here the interpretation of the regression coefficient is based 
on linkage disequilibrium and association, whereas the origi- 
nal interpretation is based on linkage between the marker and 
the QTL. 

When multiple QTLs are in LD with the marker, the con- 
ditional expectation for y,- given is E {yi\xik = AkAk) = 
^f(2Rki - DA, E (y,\x,k = AkUk) = (J^h - rki)Pi, and 
E{yi\x,k = ik'^k) = — 2rki)Pi, respectively. The joint 
distribution of Q,ij and is as summarized in Table S2, 
which resembles Table 5. Still using E(^Q,,jYij) = 

'^x.k^xjkSikSjk [E {y,\x,k) - E {yj\xjk)f p (xik) pixjk), the regression 
coefficient can be derived as below. 
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Table 4 | The expected phenotype conditional to one's genotype on 
the observed marker. 



Marker 


QTL 


QTL conditional 


E{Yi\Xii,) 


genotype 


genotype 


probability 






QlQl 


,2 

l^kl 


0 - '^I'kdPi 




QlQl 


rki 0 - hi) 






QlQl 


rki (1 - rki) 






QiQi 


(1 - ru)^ 




Akak 


QlQl 


rwd - Flki) 


\nkl - rki) PI 




y,lQl 


I 1 - Tf;/) ( 1 - /-ff;/J 






QlQl 


hlf^kl 






QlQl 


(1 - hi) 




AkAk 


QlQl 


(1 - 


(2R«-1)/i/ 




QlQl 


(1 - fffc/) 






QlQl 


'^Jr/d - Flkl) 






QlQl 


f^ll 





It is assumed that the k'^ locus is the observed marker and the /"' locus is the 
QTL 

rjfi and R^i are the conditional probabilities of the coupling phases of the 
haplotypes as defined in Table SI 



E{b) = 



-2pk{qk-pk)r I -,2 3 

-[S,= iTH^;J Ipkqk 



+ 



2pkqk 
-'^pkqk 
2pkqk 



^i^LinifiifpUl 



+ 



2qk (qk - Pk) 



+ ■ 



+ 



2pkqk 

-2pk {qk - pk) 
2pkqk 

-^pkqk^v^L 



[^Li^klfilflplqk 



+ 



2pkqk 

2qk (qk - pk) 

2pkqk 



i[^l=l-^klfil]plql 



['^l=i-T:iclPlf^plqi< 



= -^pkqkl'^tl'^klPlf 
Equation (8) can be rearranged as 



E(b) = -lY^l^^^Y^l^^^pki^PkhOi^ai-,. 



(8) 



(9) 



It is easy to see that when L = 1, Equation (9) can be simphfied 
to Equation (7). 

Scenario II: multiple markers and multiple QTLs 

When the genetic relatedness matrix is constructed with M 
markers, each of which may be in LD with L QTLs, the HE 
regression becomes Yy = a + hQij, in which f2y = ^'E^StkSjk- 
For convenience, Qijk denotes the relatedness fraction constructed 
with the k'^ marker between the and the individuals. 



According to the definition of the regression coefficient, b = 

Cov(nij,Yij) 
var(Qij) ' 

1 / ^ \ J M 

cov Y,j) = -cov I ^ Q.,,k, ^] = -^Y.'°'' (^y^' 

\k=l I k=\ 

1 ^ 2 
k=\ 

var (f2y) = Y.f^ ^ Ej^ ^cov {Qyk, Xyi) /hf, in which 
cov (fJyVc, Xiji) = p^p as expressed in Equation (4). 



E{b) 



Sf= 1 - Apkqk [St ,^kM' 1 / f Sf= 1 



M 



M2 



After rearrangement 



E{b) = -2crjA + A 



(10) 
(11) 



in which o\ = Y,f2piqifif, A 



'A 



yM yL _2 
ML 

yM yM „2 
M2 



$ and 

Pm 



2pkli Pkl2 VP'i IhPh 1h sum- 

j;M j-L 2 

marizing the between-locus variance. Pg = '^='^^1=' is the 
average squared LD between a marker and a QTL across the 

j^M 2 

genome, and p^ = ''='j^^=' " is the averaged LD between every 
pair of markers, including the LD between each marker to itself. 
The interpretation of Equation (11) will be clear in Simulation 
III and Simulation IV. 

If the phenotype is standardized, heritabQity equals the addi- 
tive variance component. It is straightforward to obtain an esti- 
mate of the heritability for a single QTL, as in scenario I, or all 
QTLs, as in scenario II (See Supplementary Note IV) 



(12) 



THE SAMPLING VARIANCE OF THE REGRESSION COEFFICIENT 

The sum of square error (SSE) is 

SSE = var (7^) - P-var 

var (Yy) = 8<T^ (Supplementary Note III), and V'var (p^ij) = 
-4 

4(T^A^ var {^ij) = 4a^^ 

SSE = 8a^ - A.al. 

USE = SSE/Af', in which Af' = ^^^^ - (d -I- 1) and d is the 
number of the regression coefficient (here d = I). 



Ob - 



MSE 



8a^^-4a^^\ /{Af'varin,j)} (13) 
Pm 
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Table 5 | The joint distribution of E{Uij]andE[ V,y{x/, Xj) for one marker and one QTL. 



Individual / 



Genotypelx,) 



Sik 



-2pk 

-J^PkQk 
(1 - 2r^,) P 



Qk-Pk 
•J'^-PkQk 
(Flkl - I'kl) /*/ 



2qk 
V^PkOk 

(2R«-1)^/ 



Genotype(xy) 



Frequency 



2Pk<1k 



3k3k 



AkSk 



-2pk 
-/TpkOk 



q-p 



(1 -2r„)/i, 



{J^k\ - I'k!) fil 



4 



^PkQk 





-^PkiQk - Pk) 


-^PkQk ■ 


1^ 2pi;qk 


2pkqk 


2pkqk H 




(-rki)^Pf 


4(-r«)2^2 ■ 




2pk4 


44 m 



-2pkiqk - Pk) 


(Qk-Pkf 


2qk(Qk - Pk) ■ 


2PkQk 


2PkQk 


2PkQk ■ 


^2 fl2 
"^klPl 


m 0 


(-nifpf ■ 


2pkcrl 


"^pWk 


2plqk 



AkAk 



2q 

J2pq 



(2%- 



4 



s t represents the standardized genotypes of the k'^ locus. 

For the nine cells, the symmetrical cells are highlighted in same colon In each highlighted cell, three terms from the top to the bottom are ilij 
[E (K,ix,),) — E {yj\Xjk)Y ^""^ "^^"^ frequencies. 
T:kl= T - I'kl - Flkh 



-^Pkqk 


2qk{qk - Pk) 


H 


H 2Pkqk 


2pkqk 


2Pkqk 




^klPl 


0 




2Dla, ^ 


4 



■■ SikSjk, E(Yij\Xik,Xik) -- 



For scenario I, as only one marker is used, var (f2y) = 1 and 
fill = 1- 

a. = y j^, (14) 

Given the current GWAS data, which incorporates thousands of 
individuals and often up to one million markers, it is reasonable 
to assume TV" ^ Af = ^^^-^'> and 8M, » 4a^A^. 

I 16Me 4 / — 

(^b~J—, —^-Jm, (15) 

]jN(N-l) 

For real GWAS data with about one million markers, Me = 
— i-TT = -X- ranees from 30,000 to 50,000 markers due to the 

var(nij) pj^ & 

Strong LD pattern (Vinkhuyzen et al., 2013). 

When the phenotype is standardized, the sampling variance 
of the regression coefficient is half of the additive variance 
component. 

1 - 2 / — 
^h^ = ~ ^^^'^ ^^^^ 

THE MATHEMATICAL EXPECTATION OF THE HE REGRESSION 
INTERCEPT 

The expectation of the intercept is £ (7,^) = — y,) ] = 

£(rf ) + Eiyj) - lEiny,). E (yj) = var (y,) - E (y,)', £ {y]) = 

var (y,) - E (y,)^ E {y,y-) = cov (y,-, yj) + E (y,) £(y,). As the 
individuals are not related to each other, assuming no 



common environment, cov (y,, yj) = 0. So, E (7/,) = 
var (y^j + var (yj) = 20^ + Icr^ , twice the phenotypic vari- 
ance. The negative ratio between the regression coefficient 
and the intercept provides an estimate of the heritability if the 
phenotype is not standardized. 

The derived regression coefficients and their sampling variance 
at the completion of the derivation are summarized in Table 6. 

THE ADDITIVE VARIANCE COMPONENT STRUCTURE OF A 
QUANTITATIVE TRAIT WITHOUT ASCERTAINMENT 

The additive variance of a trait is defined as cr^ = E;^^ ^IpiqiPf + 
S/; ^1 Ef^ 2p;j ^PhqiipijqkPh ■ However, for a complex trait 
with polygenic genetic architecture, if the QTLs are randomly 
allocated along the genome, cr^ = 'Ef'^^lpiqifif (Supplementary 
Note V), a phenomenon that the between-locus covariances 
tradeoff This is often true for a trait without ascertainment or 
selection. When each QTL is tagged perfectly and randomly allo- 
cated along the genome, A = 1. Equation (11) zeros out the A 
term and directly gives the unbiased estimate of twice negative 
of the additive variance. Removing the scale makes the heri- 
tability estimate unbiased. In practice, due to imperfect LD, the 
heritability is reduced toh^A. 

In fact, the HE regression and the mixed model are equivalent 
and can agree on the heritability estimate (see Simulation III). 
However, this equivalence can be disturbed when QTL effects are 
not randomly distributed (Simulation IV). 

EXTENSION TO CASE-CONTROL GWAS DATA 

Like the debut application of the original HE regression for 
schizophrenia (Elston et al., 1973), the IBS HE regression is also 
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Table 6 | Summary of the derivations. 



Scenario 



One marker and one QTL 
One marker and multiple OTLs 

IVIultiple markers and multiple QTLs 



In genetic parameters 



Apkqk\_^^^T:klPl\ 
I 



Elb) 



M 



vM „2 



/=! Pkl 



/M2 



In statistical 
parameters 



-2crjA if QTLs are 
randomly allocated along 
the genome. 



Af' 
As above 



For E{b], the first expression is derived from the conditional probability and the second expression is for statistical neatness. 
When the phenotype is standardized, h^ = —0.5b and cf 1^2 = O.Sai,. 



extended to case-control GWAS data in this study. However, due 
to scale issues and ascertainment (Dempster and Lerner, 1950; 
Falconer, 1966), the estimated heritability needs to be trans- 
formed to the liability scale, which is genetically meaningful for 
ascertained samples. One transformation was proposed by Lee 
et al. (2011), denoted here as Hong23. It is expressed as = 
^o ^*z7^' P(i-P)' ' which is the heritability on the liability 
scale, h-l is the heritability on the observed scale directly estimated 
based on the case-control data, K is the prevalence of the disease, 
P is the proportion of the cases in the data, and z is the height 
of the standard normal distribution in which the prevalence K is 
located. 

Once the heritability is estimated by the HE regression on the 
observed scale with Hong23, it can be easily transformed from the 
observed scale to the liability scale. Simulation studies wiU be con- 
ducted to investigate whether the HE regression better estimates 
heritability than does the mbced linear model (Simulation IV). 

In addition, Y in the HE regression can also be expressed as a 
cross-product, and then E (b) = —Ipkqi^r^iPf , which is half that 
of Equation (7) (See Supplementary Note VI). 

MONTE CARLO SIMULATION RESULTS 

In the Monte Carlo simulation, we will investigate the precision 
of the derived equations. 

SIMULATION I: ONE MARKER AND ONE QTL [EVALUATION OF 
EQUATION (7)] 

This simulation investigated the accuracy of Equation (7) for 
a single-marker application. One thousand unrelated individu- 
als were simulated. One marker and one QTL were simulated, 
both of which were equifrequent and biallelic. The heritabil- 
ity of the QTL was 0.5. The LD between the marker and the 
QTL was set at three levels: p = 0.25, p = 0.5, and p = 0.75. 
The single marker was used to construct the genetic related- 
ness, Q,. Then a single-marker-based HE regression was con- 
ducted. After standardizing the phenotype, the negative half 
of the regression coefficient returned the unbiased heritability 
estimate. 

As indicated by Equation (7), given p = 0.25, p = 0.5, and 
p = 0.75, the regression coefficient expectation was —0.062, 



0.125, and 0.57, respectively. After 100 rounds of simulation, the 
derived expectation of the regression coefficient, as well as the 
sampling variance (Table 6), were in good agreement with the 
simulation results listed in Table 7. This simulation indicates that 
the single-marker HE regression is a competitive tool for QTL 
mapping. 

SIMULATION II: STATISTICAL POWER OF THE SINGLE-MARKER HE 
REGRESSION 

For the single-marker HE regression, as the expectation and 
the sampling variance of the regression coefficient were already 

derived, a f-test could be constructed as f = -jj^ ' in which 
the linkage disequilibrium between the fc* marker and the l'^ 
QTL is Pkl. When the sample size is sufficiently large, the t- 
test approaches the z-score distribution, and the non-centrality 

parameter of /j is consequently n'^-^ ■ Nh^p^^ ~ a x^-test 
with one degree of freedom. In Table 8, the required sample size 
to detect association with a SNP for a GWAS (type-I error rate of 
10^^) and the required sample size to detect a QTL are indicated. 

In contrast, for a conventional one-marker association lin- 
ear regression, y; = 11 + hkSik + e;, if the phenotype and the 
genotypes are both standardized, E (bk) = PkiPi, and its stan- 

dard error is cr^j = / a f-test can be constructed as f = 

PklPl/^/j^i^ = i-^h^p2^ - Taking the square of the f statistic, 
the non-centrality parameter of Xi is ^'\ip2 ~ ^^^Pkl ~ ^i- 

These two tests differ by the factor N^'-^. Once N > 
-j^TpT' single-marker HE regression is more powerful than the 

conventional liner regression; otherwise, the conventional linear 
regression method is more powerful. As listed in Table 9, given 
that the heritability of a QTL is 0.01, if the LD between the 
target marker is low {pki = 0.25), medium {pki = 0.5), or high 
{pkl = 0.75), the sample size required to allow HE to outperform 
the linear regression is 6400, 1600, and 712, respectively. If the 
heritability is even smaller, say = 0.001, the required sample 
size is 12,800, 3200, and 1423 to make the HE regression more 
powerful under the low, medium, and high LD, respectively. 
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Depending on the sample size, heritability, and LD patterns 
between the QTL and the target marker, the power of the HE 
regression may or may not be greater than the conventional linear 
regression. However, when the sample size is large, or the heri- 
tability of the QTL is large, HE regression is a more powerful tool 
for association studies. These results are based on the assumption 
that the real sampling variance agrees with the derived theoretical 
result. 

SIMULATION III: THE ALL-MARKER HE REGRESSION AND THE MIXED 
LINEAR MODEL ARE EQUIVALENT [A = 0 IN EQUATION (11)] 

In this simulation, 100 equifrequent and biallelic QTLs were sim- 
ulated, and the additive effect of each QTL was sampled from 
JV (0, 1). Four LD levels {pi^j^ = 0, 0.25, 0.5, 0.75) were adopted 
for each of two consecutive QTLs, and the effective number of 
markers decreased correspondingly (M^, ~ 100, 90, 61, 29). One 
thousand unrelated individuals were simulated, and the genetic 
relatedness of each pair of individuals was estimated on these 
100 QTLs. The heritability of the simulated polygenic model was 



Table 7 | Simulation evaluations of Equation (7). 


LD 


Analytical results^ 


Simulation results'* 


p = 0.25 


-0.062 (0.004) 


-0.062 (0.0039) 


p = 0.5 


-0.25 (0.004) 


-0.25 (0.0039) 


p = 0.75 


-0.56 (0.004) 


-0.56 (0.0039) 



" The standard error was calculated: at = yj =s ^ ~/Me. Here N = 1000 

and Me = 7. 

''The standard errors in parentheses indicate the mean of the standard error 
from 100 simulation replications. 



Table 8 | The sample size required for the single-marker HE regression 
to detect a QTL associated with the target marker. 







Pkl 






0.25 


0.5 


0.75 


0.005 


33,276 


8,319 


3,697 


0.01 


16,638 


4,159 


1,849 


0.025 


6,655 


1,664 


739 


0.05 


3,327 


832 


370 


Here the p-value cutoff was 10 ®. 


Table 9 | The required sample size that makes the HE regression more 


powerful than the conventional single-marker linear regression. 








PkJ 






0.25 


0.5 


0.75 


0.005 


12,800 


3,200 


1,423 


0.01 


6,400 


1,600 


712 


0.025 


2,560 


640 


285 


0.05 


1,280 


320 


143 



0.5, which is calculated as = -^^^.Andcr^ = Y^^^^lpiqifij + 

Both the HE regression and the mixed linear model were 
employed to estimate the additive variance component. The 
mixed linear model (Yang et al., 2010) can be expressed as y, = 
fjL + Xijttj + e,-, where y, is the phenotype of the i'^ individual, /x 
is the mean, is the indicator variable with values of 0, 1, or 2 
depending on the reference allele counts, and Cj is the residual. 
Restricted maximum likelihood (REML) was employed to esti- 
mate the variance components of the mixed linear model (Yang 
etal, 2010). 

As shown in Table 10, the estimated heritability from either 
the HE regression or the mixed linear model was equal and not 
biased, demonstrating the equivalence between the HE regres- 
sion and the mixed linear model when the QTLs are randomly 
distributed regardless of their pairwise LD. 

E{h) = — 2<T^A (ignoring A) sheds light on the inference of 
the general LD pattern between the tagged markers and the causal 

-2 

variance. A = ^, and pIj can be estimated from markers. If the 

Pm 

heritability of the trait is known (not likely though), it is possible 
to estimate Pq. For example, the heritability of height is estimated 
at around 0.8 (Visscher et al., 2006; Perola et al., 2007) in linkage, 
but is 0.4 as estimated in an association study (Yang et al, 2010). 
If the estimate from linkage was considered to be the true her- 
itability, A = 0.5. Assuming the effective number of markers is 
Mc = 10, 000, = 0.0001, p^ = Apli = 0.00005. The average 
absolute value of the LD between a QTL and a marker is 0.007. 

SIMULATION IV: WHEN THE HE REGRESSION AND THE MIXED LINEAR 
MODEL ARE NOT EQUIVALENT [WHEN A #OIN EQUATION (11)] 

The general setting for this simulation was similar to the last 
one, but the QTL effects were sorted such that the addi- 
tive effects were increased along the simulated chromosomal 
segment. The covariance between any two QTLs can be pre- 
dicted by cov(Q;j, ) = P/i ,;2 VP'i IhPk H-h Ph Ph ■ The heritabil- 
ity is defined as = ^i^^i » in which cr^ = 'Ef^^lpiqifif + 
S,^ ^ 1 Sf^ ^ ,j 2p(j ^PhqiiPhlhPh Ph ■ Different from the last sim- 
ulation, A = -2Y,^^^{i:f^^^i:f^_^,2pa,pu2m,ffh)/M / o. 

With this set-up, which is not likely to be true in practice 
but illustrates an extreme case, the HE regression and the mixed 



Table 10 | Simulation evaluations of Equation (11) and comparison 
between the HE regression and the mixed linear model method 
(A = 0) . 



LD (,,) 


Equation (11)° 


HE results'" 


Mixed model results'^ 


p = 0 


0.5 (0.020) 


0.499 (0.020) 


0.499 (0.041) 


p = 0.25 


0.5 (0.019) 


0.500 (0.019) 


0.501 (0.042) 


p = 0.5 


0.5 (0.016) 


0.502 (0.015) 


0.491 (0.043) 


p = 0.75 


0.5 (0.011) 


0.488 (0.011) 


0.508 (0.048) 



" Calculated given A = 0. 

''•"The standard errors in parentheses indicate the mean of the standard error 
from 100 simulation replications. 
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Table 11 | Simulation evaluations of Equation (11) when the 
covariance summation is not zero (A ^ 0). 



LD (/)) 


Equation (11)^ 


HE results'* 


Mixed model results'^ 


p = 0 


0.500 (0.020) 


0.497 (0.020) 


0.499 (0.041) 


p = 0.25 


0.715 (0.019) 


0.712 (0.019) 


0.414(0.041) 


p = 0.5 


0.853 (0.015) 


0.850 (0.015) 


0.347 (0.043) 


p = 0.75 


0.878 (0.011) 


0.881 (0.011) 


0.291 (0.048) 



" Calculated given A ji 0. 

^■'^The standard errors in parentheses indicate the mean of the standard error 
from WO simulation replications. 



model gave very different estimations. With increased correla- 
tion between markers, the HE gave inflated estimates and the 
mixed model gave deflated estimates. Although both methods 
gave biased estimates, Equation (11) still could predict the results 
of the HE regression correctly (See Table 11). 

SIMULATION V: APPLICATION TO CASE-CONTROL DATA 

The HE regression was applied to case-control data. A poly- 
genic model of L equifrequent diallelic QTLs was simulated, and 
each locus was in Hardy- Weinberg equilibrium and any pair of 
QTLs was in linkage equilibrium. The heritability on the liabil- 
ity scale was hj, the heritability on the liability scale. The effect 
of each QTL was sampled from N(0, a^), and cr^ = /[2 x p x 
(l — p) X L], in which p = 0.5. The phenotype of each individ- 
ual under the liability scale was scaled to unit. The ascertainment 
of cases on the liability scale was K. Individuals were sampled 
from the described reference population until 1000 cases and 
1000 controls were recruited. 

The heritability on the liability scale was 0.5. In order to cover 
a broad range of scenarios, three levels of QTL number, L = 100, 
1000, and 10,000, and three levels of disease prevalence at the 
population level, K = OA, 0.01, and 0.001, were adopted. Nine 
scenarios were simulated in total, and 30 independent simulation 
replications were implemented for each scenario. 

The genetic relationship matrix was constructed using all indi- 
viduals and the allele frequencies were estimated from the sample. 
The genetic additive variance components were estimated with 
the HE regression and the mixed model method. As the directly 
estimated variance component was on the observed scale and 
could be greater than 1, we employed both the REML and non- 
constrained REML for mixed model methods, which allowed the 
heritability to be greater than 1. 

As illustrated in Figure 1, the estimated was compared 
across all three methods. In general the HE regression resulted 
in a more precise estimate than that of the REML and non- 
constrained REML. For the mixed model methods, either with 
or without constraints, REML often underestimated the variance 
components. The bias was caused by two factors: the number of 
QTLs (in each row panel) and the prevalence of the disease (in 
each column panel). With fewer QTLs, a lower prevalence could 
exacerbate underestimation by the mixed model. 

CONCLUSION 

The analytical results summarized in Table 6 were evaluated using 
Monte Carlo simulation, and were highly precise in general. The 



single-marker HE regression is a competitive tool for QTL map- 
ping, particularly with a large sample size (Simulations I, II). The 
HE regression and the mixed model method were equivalent, with 
both providing a precise heritability estimate for a typical poly- 
genic trait (Simulation III). However, if QTL effects are correlated, 
neither the HE regression nor the mixed model method gave an 
unbiased estimate (Simulation IV). For case-control studies, the 
HE regression should be preferred in general (Simulation V). 

GENETIC ANALYSIS REPOSITORY (GEAR) 

In order to facilitate application of the HE regression method to 
estimate complex trait heritability, GEAR software was developed. 
GEAR was developed on Java and can run across many operat- 
ing systems, such as Windows, Mac, and Linus/Unix, as long as a 
Java virtual machine is available. GEAR has been demonstrated to 
function in the following situations. 

(1) It can generate genetic relatedness of unrelated individu- 
als, as formulated in Equation (3), based on whole-genome 
markers. 

(2) It can estimate the effective number of markers based on a 
genetic-relatedness matrix. 

(3) It can estimate heritability with the HE regression. GEAR can 
read genotype data saved in PLINK binary format (Purcell 
et al, 2007). 

GEAR can be downloaded from the website: https://sourceforge. 
net/ projects/ gbchen/files/GEAR/ 

The online GEAR manual can be found at https://sourceforge. 
net/p/gbchen/wiki/GEAR/ 

DISCUSSION 

Historically, linkage was the major tool for QTL mapping of 
complex traits since the 1970s, which was gradually replaced by 
association analysis when GWAS became popular (The Wellcome 
Trust Consortium, 2007). The transmission/disequilibrium test 
(TDT; Ott, 1989; Spielman et al, 1993) triggered the transi- 
tion from linkage to association for family-based studies. In the 
year 2000, generalized TDT was proposed (Laird et al., 2000), 
which is robust for population stratification. Shortly after that, 
population-based design emerged as the major flow in genetic 
data, and GWAS became the leading method for estimating her- 
itability up until now. Extension of the original HE regression to 
association studies can be seen as an effort to increase the diversity 
of GWAS analysis tools. 

In this study, we established a theory for a modified HE regres- 
sion, in which IBS scores replace IBD scores. Although IBS is 
used to detect IBD in linkage studies (Lange, 1986a,b; Bishop 
and Williamson, 1990), it is considered to be a way of inferring 
IBD for relatives, such as sib pairs, when founder genotypes are 
unavailable. In this study, IBS served as the key concept to detect 
association of unrelated samples rather than relatives. Linkage 
and association have both been proposed to estimate heritabil- 
ity of complex traits. For example, for height, the heritability 
estimated from linkage studies is around 0.8 (Visscher et al., 
2006; Perola et al., 2007), but around 0.4 from association studies 
(Yang et al., 2010). Thus, far there is no clear conclusion regard- 
ing the fundamental difference between the heritability estimated 
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100 QTLs 




1,000 QTLs 



6 



CM 






REML 
- NC-REML 
■ LSE 



K = 0.1 



1 

K = 0.01 



1 

K = 0.001 



10,000 QTLs 



(M _ m 






K = 0.1 



K = 0.01 



K = 0.001 



FIGURE 1 I Estimation of heritability on the liability scale using 
the HE regression and mixed linear model methods. In eacli row, 
from left to right, each panel represents the case-control sample 
simulated under the same heritability on the liability scale (h^) but 
with different prevalence. In each panel, the vertical axis indicates 



horizontal axis indicates which of the three methods (REML, 
non-constrained REML, and HE regression [least square estimate]) 
was used. The standard error of the mean (SEM) is indicated at 
the top of each bar. 



from these two kinds of methods. Despite their mathematical 
similarity, application, and interpretation differences should be 
appreciated. 

Under various scenarios, the mathematical expectations of the 
regression coefficients, as well as the sampling variances, were 
derived. There is substantial mathematical similarity between the 



IBD HE regression and the IBS HE regression. For example, for 
both models under the single-marker scenario, their regression 
coefficients can be expressed in a unified form, E{b) = —Ip^aj^. 
As these two models are based on different genetic mecha- 
nisms, the interpretations of their respective regression coeffi- 
cients are reasonably different. In the IBD-based HE regression, 
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E{b) = — 2 (1 — 2c)^ cr^^, 1 — 2c ranges from 0 to 1; whereas 
in the IBS-based HE regression, E{b) = —Ix^fi'^^, in which t^, 
ranges from —1 to 1. As the values of r and R rely upon the 
allele frequencies of the biallelic marker and the biaUelic QTL, 
they reach either —1 or 1 only given that the marker has the 
same allele frequency as that of the QTL. However, after tak- 
ing the square, both (1 — Ic)^ and lie between 0 and 1, 
inclusive. 

Equation (11), E(b) = —2a^A (ignoring A), provides a pos- 
sible way to estimate the LD pattern between causal loci and 
markers. If the true heritability is not readily known, it is pos- 
sible to estimate Pq, the average LD between QTLs and markers. 
As demonstrated in simulation III, it may be possible to estimate 
Pq. However, the causal loci can be in any possible form, such 
as SNPs, chromatin markers, or methylation markers, and dif- 
ferent methods capture genetic variation in different forms. In 
practice, the obstacle in estimating Pg lies in how heritability esti- 
mated from different methods, such as linkage and association, 
or genotyping platforms, such as SNP markers and methylation 
markers, can be connected to each other. Equation (11) sheds 
light on the investigation for how QTLs are distributed along the 
genome. 

Application of the HE regression to heritability estimation of 
complex traits revealed that the HE regression seems to be equiva- 
lent to the mixed model approaches in general (A = 0). A similar 
equivalence was previously established for linkage analysis (Sham 
and Purcell, 2001). However, for GWAS data, it should be noted 
that the equivalence is conditional on the genetic architecture of a 
trait. As indicated in the simulation, the equivalence stands only 
for typical polygenic genetic architecture, which may be true for 
many traits without ascertainment or selection, such as height 
(Yang et al., 2010). However, when substantial covariance exists 
between causal loci, the equivalence does not stand and neither 
the HE regression nor the mixed model method gave unbiased 
estimates. The equivalence may break down under other circum- 
stances that have not been investigated. In real studies, this kind 
of covariance may be a result of selection in active regions, such 
as HLA loci, which harbors many signals; then, the HE regres- 
sion and the mixed model estimates may differ. The equivalence 
may break down under other circumstances that have not been 
investigated in this study. 

In GWAS, many samples are collected for complex diseases, 
which are often in a case-control design. Complex disease preva- 
lence is often low; consequently, the cases are under strong 
ascertainment, which disrupts the assumptions underlying the 
mixed linear model. As observed in the simulation studies, the 
HE regression is more precise in estimating heritability than the 
mixed linear model for case-control studies across a broad range 
of scenarios. Use of HE regression is advantageous when the dis- 
ease prevalence is low and the number of causal loci is few. In their 
original work, Lee et al. (2011 ) assumed an infinitesimal model of 
complex diseases. However, when this assumption was disrupted 
during simulation (likely in practice as well), the mixed linear 
model method gave biased estimates of heritability. Thus, when- 
ever possible, the HE regression method is preferable to estimate 
heritability of complex traits. 



As derived in this work, the HE regression and the mixed 
model method are equivalent under polygenic genetic architec- 
ture. In other words, when the estimates generated by these two 
methods significantly differ for the same data, caveats should be 
presented. As investigated in the simulation, the real heritability 
may lie between the estimates of these two methods. Speed 
et al. (2012) previously investigated the assumptions underly- 
ing the mixed model method and proposed alternative weighting 
methods to adjust the heritability estimation. However, as their 
weighting method depends on genetic architecture, which is 
often unknown, it is difficult to justify which weighting method 
is appropriate to adopt for certain data (Gusev et al., 2013). 
Thus, simply comparing the estimates from the HE regression 
and the mixed model method may offer an alternative way of 
justification. 

It should be noted that the HE regression method is on the 
basis of the least square framework rather than the maximal 
likelihood framework as many mixed model based on (Yang 
et al., 2010; Speed et al, 2012; Lee et al., 2013). As a numerical 
method, maximal likelihood methods give estimates optimizing 
the likelihood under the assumptions, which may break down in 
practice. Given recent interests in comparing estimates with or 
without imputation for the genome (Gusev et al., 2013), con- 
troversial results have been observed. It is not sure what the 
increased or decreased estimation of heritability indicates after 
imputation. A reasonable guess will be that the local covariance 
structure, as indicated in Equation (11), changes and eventually 
bring out different estimates. The proposed IBS HE regression, 
which depends on fewer assumptions compared with maximal 
likelihood methods, may help melt the controversy. 

In practice, undocumented relatedness may creep into sam- 
ples, and eventually bring about suspiciously high relatedness. As 
discussed previously (Powell et al., 2010), Equation (3) gives a 
score of 0 for a pair of unrelated individuals, 0.5 for first-degree 
relatives, and 1 for duplicated individuals or monozygotic twins. 
It seems easy to eliminate related individuals if a cutoff, say a relat- 
edness of less than 0.05, is applied to the sample. For association 
studies, population stratification may increase false positive rates. 
To reduce the threat of population stratification, phenotypes can 
be adjusted by principal components (Price et al., 2006) and then 
fit into the HE regression. If a sample is admixed, the power of 
the HE regression may be reduced if, in the ancestral popula- 
tions, the allele frequency spectrums are different from each other 
or genetic heterogeneity exists in the genetic architecture of the 
underlying trait in question. More investigation will be required 
to overcome this challenge. 

The variance components have often been estimated via REML 
(Yang et al., 2010; Lee et al., 2011). Given its various merits, 
REML is computationally expensive, particularly for large sam- 
ple sizes. The computational complex is on the scale of O(tiV^), 
which indicates that it is cubic to the sample size and t rounds 
of iterations. The time complex of the HE regression is far lower, 
asymptotically 0(2N^), given two parameters, the intercept and 
the regression coefficient, included in the model. Given the large 
sample sizes often employed in GWAS, the computational burden 
can be dramatically reduced. Although the HE regression method 
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is derived on a simple-regression scenario, its extension to a 
multiple-regression scenario is straightforward. For instance, the 
genetic relatedness between each pair of individuals can be con- 
structed on each chromosome and then all chromosome-based 
relatedness scores can be fit into the regression framework. In 
addition, the difference between a pair of phenotypes can also be 
expressed as a cross product and squared sum (Sham and Purcell, 
2001). 
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